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Abstract 

We study the problem of (agnostically, efficiently and improperly) learning halfs- 
paces with margin 7. Let P be a distribution over labeled examples. The 7-margin 
error of a hyperplane h is the probability of an example to fall at the wrong side of h 
or at a distance of at most 7 from it. The error of the best h is denoted Err 7 (2?). An 
efficient a(7)-approximation algorithm receives 7 as input and in time poly(^), using 
i.i.d. samples of V, outputs a classifier with error rate < a{^) Err 7 (£>). 

A popular family of such algorithms (E.g. Kernel SVM and Kernel ridge regression) 
operate by optimizing a convex loss over L 2 -regularized linear class of functions. We 
show that, for every kernel and every convex loss, the approximation ratio of such an 

algorithm must be > Q. ( po iy(iog(i/7)) ) • ^ ne resm ^ i s tight, up to a poly-log factor. 

1 Introduction 

Support Vector Machine (SVM) is one of the most popular learning algorithms for the 
fundamental task of learning large margin halfspaces. SVM has been widely used in practice 
and its statistical efficiency has been thoroughly studied (Vapnik, 1998, Anthony and Bartlet, 
1999, Scholkopf et al., 1998, Cristianini and Shawe- Taylor, 2000, Steinwart and Christmann, 
2008). The basic variant of SVM, called Hard-SVM, assumes that the data is separable 
with a margin 7. That is, there exists a separating hyperplane that not only separates 
positive examples from negative examples but also has the property that all examples are 
at a distance of at least 7 from it. Hard-SVM seeks a halfspace that separates the training 
examples with largest possible margin. Under the assumption of separability with margin, 
Hard-SVM is guaranteed to find a classifier whose classification error is at most e using 
time and sample complexity of poly(l/7, 1/e). However, this assumption is quite strong and 
rarely holds in practice. It is therefore natural to relax the problem and require instead that 
the algorithm learns a classifier whose error is comparable with that of the optimal 7-margin 
separating hyperplane. 
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The second and most popular variant of SVM, called Soft-SVM, tackles the non-separable 
case. It solves a convex optimization problem by using the so-called hinge-loss function as 
a surrogate convex loss function to the number of margin violations. Replacing the number 
of margin violations with a surrogate convex loss is a heuristic. It is easy to see that this 
heuristic leads to an approximation ratio of 6(1/7) (e.g. (Ben-David et al., 2012)). But, 
this lower bound is only known to hold when applying a surrogate convex loss function 
on the original representation of the examples. However, one of the most crucial reasons 
for the success of the SVM paradigm is the celebrated kernel trick. Kernels allow us to 
dramatically extend the repertoire of classification functions we can learn, by embedding the 
data in a high dimensional space. For example, using a polynomial kernel, we can learn all 
polynomials of bounded degree. This considerably extends the class of affine functions, used 
by the "straight" Soft-SVM. The goal of this paper is to derive strong lower bounds on the 
achievable approximation ratio of learning large margin halfspaces by applying SVM with an 
arbitrary kernel function. In addition, we derive a lower bound on the approximation ratio 
of any method that involves embedding of the examples into a feature space of dimension 
polynomial in I/7 and then learning a linear classifier in this feature space, by optimizing a 
convex surrogate. 

1.1 Learning large margin halfspaces 

Let B be the unit ball of some Hilbert space 1 H. A halfspace, parameterized by w G B and 
b G R, is the classifier f(x) = sign(A W]b (a:)), where A Wjb (x) = (w, x) + b is an affine functional. 
Given a distribution T> over B x {±1}, the error rate of A W)b is 

Err Dj0 _i(A m , 6 ) = Pr (sign(A w>b (x)) ^ y) = Pr (yA w , b (x) < 0) . 

More generally, we define the error rate of / : B — > IR as Err^o-^/) = P T (x, y )~T> (yf( x ) < 0). 
The 7-margin error rate of / is 

Err Cj7 (/) = Pr (yf(x) < 7) . 

Note that for a halfspace classifier A w>b , if ||w|| = 1 then |A tiJ) ji,(a;)| is the distance of x 
from the separating hyperplane. Therefore, the 7-margin error rate is the probability of 
x to either be in the wrong side of the hyperplane or to be at a distance of at most 7 
from the hyperplane. The least 7-margin error rate of a halfspace classifier is denoted 
Err 7 (£>) = min t06B)66K Err 2 j i7 (A 10>6 ). 

A learning algorithm receives 7, e and can receive i.i.d. samples from T>. The algorithm 
should return a classifier (which need not be an affine function). We say that the algorithm 
has approximation ratio 0(7) if for every distribution, it outputs (w.h.p. over the i.i.d. 
P-samples) a classifier with error rate < 0(7) Err 7 ("D) + e. An efficient algorithm uses 
poly (I/7, 1/e) time and samples, and the output classifier can be evaluated in poly(l/7, 1/e) 
time. 

1 For concreteness, we assume that the Hilbert space is I 2 . We will later on consider finite dimensional 
spaces by viewing M. d as embedded in £ 2 by the standard embedding. 
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It is well known that for data which is separable with margin 7, i.e. Err 7 (D) = 0, the 
Support Vector Machine (SVM) algorithm finds a classifier with error < e with time and 
sample complexity < poly(l/7, 1/e). The problem of proper 2 learning of halfspaces in the 
non-separable case without the margin assumption was shown to be hard to approximate 
for any constant approximation factor (Feldman et al., 2006, Guruswami and Raghavendra, 
2006). Proper learning with the margin assumption was also shown to be hard (Ben-David 
and Simon, 2000). It has been recently shown that improper learning under the margin 
assumption is also hard (under some cryptographic assumptions), namely, no polynomial 
time algorithm can achieve an approximation ratio of 0(7) = 1. As we note next, this is a 
far cry from what currently known algorithms can accomplish. 

The best currently known efficient algorithms (Birnbaum and Shalev-Shwartz, 2012) and 
(Long and Servedio, 2011) achieve an approximation ratio — , 1 The algorithm of 

T\/log(l/T) 

(Birnbaum and Shalev-Shwartz, 2012) applies Soft-SVM with a particular kernel. Several 
other authors use similar techniques: Mapping the examples into some feature space and 
learning a halfspace there using a surrogate convex loss. For example, (Kalai et al., 2005, 
Blais et al., 2008) developed a PTAS for agnostically learning halfspaces under several dis- 
tributional assumptions based on a polynomial transformation. This relies on embeddings 
that turn affine classifiers to polynomials of a certain degree. Our lower bounds apply to 
these polynomial spaces and to every finite dimensional space of classification functions. 

The learning theory literature contains consistency results for learning with the so-called 
universal kernels and well-calibrated surrogate loss functions. This includes the study of 
asymptotic relations between surrogate convex loss functions and the 0-1 loss function 
(Zhang, 2004, Bartlett et al., 2006, Steinwart and Christmann, 2008). It is shown that 
the approximation ratio of SVM with a universal kernel goes to 1 as the sample size grows. 
Our result implies that this convergence is very slow, e.g., an exponentially large (in -) 
sample is needed to make the error < 2Err 7 (D). 

Before stating our results, let us recall the kernel SVM approach as well as other feature 
mapping based approaches that do not depend on the kernel trick. 

1.2 Kernel based learning and kernel SVM 

The significance of separation with margin is a fundamental discovery of statistical learning 
theory. The affine function that minimizes the empirical 7-margin error rate over an i.i.d. 
sample of size poly(l/7, 1/e) has error rate < Err 7 (P) + e. However, this minimization 
problem is computationally hard (see (Ben-David and Simon, 2000)). 

The SVM paradigm deals with this hardness by replacing the margin error rate with a 
convex surrogate loss, in particular, the hinge loss /hinge {%) = (1 — x)+- For simplicity, let 
us also define the — 1 loss function, / -i(x) = l(x < 0), and the 7-margin loss function, 
/ 7 (x) = l(x < 7). For any loss function, /, define Err c ^(/) = E( x , y )~v(Kv f ( x Yi) • We also 
use shorthands such as Err^hinge instead of Errx> t i hin e - Note that for x G [—2,2], 

k-\{x) < /hm g e(a;/7) < (1 + 2/i)l 1 {x) , 

2 A proper learner must output a halfspace classifier. Here we consider improper learning where the learner 
can output any classifier. 
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from which it easily follows that a solution to the following problem 

min Errx> jhinge ( i A Wj6 ) s.t. w G H, b G R, \\w\\ H <l 

w,b \ i / 

gives an approximation ratio of a (7) = 1 + 2/7. It is more convenient to use the problem 
min Errx, jhinge (A Wj6 ) s.t. w G H, b G R, |H|# < C , (1) 

which is equivalent for C = -. Problem (1) can be approximated, up to an additive error of 
e, using a poly(^, sample and time. 

Kernel-free soft SVM means solving Problem (1). Kernel SVM has the additional freedom 
of embedding B into the unit ball of another Hilbert space, thus extending the repertoire of 
possible output functions. Concretely, let ip : B — > B%, where B\ is the unit ball of a Hilbert 
space Hi. The embedding ip need not be computed directly. Rather, it is enough that we 
can compute the corresponding kernel, k(x,y) := (ip(x),ip(y))Hi- It remains to solve the 
following program 

min Errx> ihinge (A w>6 o ip) s.t. w G Hi, 6 el, \\w\\ Hl <C . (2) 

w,b 

This problem can be approximated, up to an additive error of e, using poly(C/e) samples 
and time as follows. Let yi), . . . , (x n , y n ) be a sequence of i.i.d. samples from V. Let D 
be the empirical distribution over these examples. By a uniform convergence argument (e.g. 
(Boucheron et al., 2005)), if n = Q(C 2 /e 2 ), then w.h.p. over the choice of examples we have 

max I Err^hingc (A Wtb o ip) - Err^, hi (A w>b o ip) \ < e/2 . 

b£R,w£Hi,\\w\\ Hl <C ' 

Therefore, we can solve (2) w.r.t. D instead of w.r.t. T>. It is easy to verify that there 
exists a solution of the form w = Y^=i a i' l l ! ( x i) ^° the problem w.r.t. D for some real a^s. 
Therefore, we can optimize over a G R n instead of over w G Hi, which yields the problem 

1 n / / n \ \ n 

min — 2_. ^hinge I Vj I onk{xi, Xj) + b J J s.t. aiajk(xi,Xj) < C . 

am ,bm n . =1 \ \ i=1 J J i j=l 

In this formulation we access examples only via the kernel function (a.k.a. the "kernel trick"). 
This is a convex optimization problem on n+1 variables, which can be solved in time poly(n) 
by standard methods (assuming that the kernel function can be evaluated efficiently). Recall 
that n should be Q(C 2 /e 2 ) for the uniform convergence to hold. It therefore makes sense, 
and we will adopt this practice, to refer to C as the time and sample complexity of program 
(2). In particular, we will prove lower bounds to any approximate solution of program (2) 
assuming that C = poly(l/7). 

We formulate our results for Problem (2), but they apply as well to the following com- 
monly used formulation of the kernel SVM problem, where the constraint H^Hi^ < C is 
replaced by a regularization term. Namely 

g min gK ^\\w\\ 2 Hi + Err» !hingc (A Wjb o ^) (3) 
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The optimum of program (3) is < 1 as the zero solution w = 0, b = shows. Thus, if w, b 

is an approximate optimal solution, then c " 1 < 0(1) =>■ || to (lift < 0(C). This observation 
makes is easy to modify our results on program (2) to apply to program (3). 

1.3 Non Kernel-based Feature Mappings 

The kernel trick constrains the L 2 -norm of w G H\. Other algorithms require instead that w 
belongs to some set W C M. m . For example, the Lasso method (Tibshirani, 1996) constrains 
w to have a low L 1 -norm. We take the dimension of Hi as a measure of the complexity of 
the problem, and require it to be polynomial in I/7. This choice is justified, as without the 
L 2 -norm constraint, we cannot rely on the "kernel-trick" and in general must work directly 
in Hi. Therefore, every algorithm must have time complexity fi(m). We prove lower bounds 
for any approximate solution to a problem of the form 

min Err c>; (A w>b o x/;) s.t. w6lfcK m ,kl , (4) 

w,b 

where I is some surrogate loss function (see formal definition in the next section). 

Note that any finite dimensional space of functions over the ball, that includes the con- 
stant functions, is of the form {A Wi b o ip} for some ip. Hence, our lower bounds hold for 
any method that optimizes a surrogate loss over a subset of a finite dimensional space of 
functions that includes the constant functions. 

2 Results 

We first define the two families of algorithms to which our lower bounds apply. We start 
with the class of surrogate loss functions. This class includes the most popular choices such 
as the absolute loss |1 — x\, the squared loss (1 — x) 2 , the logistic loss log 2 (1 + e~ x ) etc. 

Definition 2.1 (Surrogate loss function) A function I : R — > K is called a surrogate loss 
function if I is convex and is lower bounded by the 0-1 loss. 

The first family of algorithms contains kernel based algorithms, such as kernel SVM. 

Definition 2.2 (Kernel based learner) Let I : R — > R be a surrogate loss function. A 
kernel based learning algorithm, A, receives as input 7 G (0, 1). It then selects C = Ca(i) 
and an absolutely continuous feature mapping, ip = iPa(i)- The algorithm returns a function 

A{j) G {A w , b o ij> : w G Hi, b G R, \\w\\ Hl < C} 

such that, with probability > 1 — exp(— 1/7), 

Err^(A( 7 )) < inf{Erru I ,(Ai B ,6 o : w G #1, b G R, \\w\\ Hl < C} + ^7 • 

We say that A is efficient if Ca(i) < poly(l/7). 
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Note that the definition of kernel based learner allows for any predefined convex surrogate 
loss, not just the hinge loss. Namely, we consider the program 

min Err©,/ (A TO)6 o ijj) s.t. w & Hi, b <E M., \\w\\ Hl < C . (5) 

w,b 

Uniform convergence results (e.g. (Boucheron et al., 2005)) suggest that for Lipschitz sur- 
rogate, program (5) can be approximated, up to an additive error of e and with probability 
> 1 — exp(— 1/p), from a sample of size 3 poly(C, |, -), hence our definition of efficiency here. 
We note that our results also hold if the kernel corresponds to ip is hard to compute. 

The second family of learning algorithms involves an arbitrary feature mapping and 
domain constraint on the vector w, as in program (4). 

Definition 2.3 (Generalized linear learner) Let I : R — > K be some surrogate loss func- 
tion. A generalized linear learning algorithm, A, receives as input 7 G (0, 1). It then selects 
a continuous embedding ip = iPa{i) '■ B — > R m and a constraint set W = Wa^) Q R m - The 
algorithm returns, with probability > 1 — exp(l/7) ; a function 

A(y) G {A Wtb o ip : w G W, b G R} 

such that 

Exr v>l (A(j)) < mi{Err Vtl (A wfi o ^) : w G W, b G R} + ^7 •, 
We say that A is efficient if m = "02,4(7) — poly(l/7). 



2.1 Main Results 

Theorem 2.4 Let A be an efficient kernel-based learner. Then, for every 7 > ; there exists 
a distribution T> on B such that, w.p. > 1 — exp(l/7) 7 

Err Ci0 _ 1 (A( 7 )) > ^ 



Err 7 (D) " V7 • Poly(log(l/7)) 

In fact, we will prove a stronger result which holds also if Ca{i) is not polynomial. 

Theorem 2.5 Let A be a kernel-based learner for which Ca{^) = exp(o(7 -2 / 7 )). Then, for 
every 7 > 0, there exists a distribution T> on B such that, w.p. > 1 — exp(l/7) ; 

Err w (A( 7 )) > n 



Err 7 (P) " V7-poly(log(C A ( 7 )). 
It is shown in (Birnbaum and Shalev-Shwartz, 2012) that solving kernel SVM with a 



specific kernel (i.e. a specific i/j) yields an approximation ratio of O ( — , . It follows 

\ v 7v /1 °s( 1 /t) / 

that our lower bound in Theorem 2.5 is essentially tight. Also, this theorem can be viewed 
as a substantial generalization of (Ben-David et al., 2012, Long and Servedio, 2011), who 



3 Concretely, the needed sample size is O ^ ^ C L \ 2 +1 ^ P ^j for L the Lipschitz constant of I. Since I is assumed 
to be predefined, we neglect the dependency on L. 
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give an approximation ratio of Q i^j with no embedding (i.e., ip is the identity map). Also 

relevant is (Shalev-Shwartz et al., 2011), which shows that for a certain ip, and Ca(i) = 
poly (exp ((I/7) • log (1/(7))), kernel SVM has approximation ratio of 1. Theorem 2.5 shows 
that, for some a > 0, for kernel SVM to achieve a constant approximation ratio, Ca must 
befi(exp((l/ 7 ) a )). 

Next, we lower bound the approximation ratio of every efficient generalized linear learner. 



Theorem 2.6 Let A be a generalized linear learner corresponding to a Lipschitz surrogate. 
Assume that rriA(l) — exp(o(7 -1 / 8 )). Then, for every 7 > ; there exists a distribution V 
on S^ 1 x {±1} with d = 0(log(mA(7)/7)) such that, w.p. > 1 — exp(l/7) ; 

Err Pi0 _ 1 (A( 7 )) >Q / 1 

Err 7 (X?) _ V</ypoly(log(m47)/7)) 

Corollary 2.7 Let A be an efficient generalized linear learner corresponding to a Lipschitz 
surrogate. Then, for every 7 > ; there exists a distribution T> on S^ 1 x {±1} with d = 
0(log(l/7)) such that, w.p. > 1 — exp(l/7), 

Env^jAW) / 1 

Err 7 (P) " V v ^poly(log(l/ 7 )) 

2.2 Main proof ideas 

To give the reader some idea of our arguments, we sketch some of the main ingredients of the 
proof of Theorem 2.5. At the end of this section we sketch the idea of the proof of Theorem 
2.6. We note, however, that the actual proofs are organized somewhat differently. 

We will construct a distribution T> over S d ~ l x {±1} (recall that M. d is viewed as standardly 
embedded in H = £ 2 ). Thus, we can assume that the program is formulated in terms of the 
unit sphere, S°° C £ 2 , and not the unit ball. 

Fix an embedding if} and C > 0. Denote by k : S°° x S°° — > R the corresponding kernel 
k(x,y) = (if) (x), if) (y)) Hi and consider the following set of functions over S°° 

H k = {A„, o^ : v £ Hi} . 

H k is a Hilbert space with norm ||/||# fc = inf{||t> ||^ : A„ j0 o if> — /}. The subscript k 
indicates that Hj, is uniquely determined (as a Hilbert space) given the kernel k. With this 
interpretation, program (5) is equivalent to the program 

min Err^(/ + 6) s.t. \\f\\ Hk <C. (6) 

For simplicity we focus on I being the hinge-loss (the generalization to other surrogate loss 
functions is rather technical). 

The proof may be split into three steps: 
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1. We consider the one- dimensional problem of improperly learning halfspaces (i.e. 
thresholds on the line) by optimizing the hinge loss over the space of univariate polyno- 
mials of degree bounded by log(C). We construct a distribution T> over [—1, 1] x {±1} 
that is a convex combination of two distributions. One that is separable by a 7-margin 
halfspace and the other representing a tiny amount of noise. We show that each so- 
lution of the problem of minimizing the hinge-loss w.r.t. T> over the space of such 
polynomials has the property that 7(7) « /(— 7). 

2. We pull back the distribution T> w.r.t. a direction e G S^ 1 to a distribution over 
gd-i x {±1}. Let / be an approximate solution of program (6). We show that / takes 
almost the same value on instances for which (x, e) = 7 and (x, e) = —7. This step 
can be further broken into three substeps - 

(a) First, we assume that the kernel is symmetric and f(x) depends only on (x,e). 
This substep uses a characterization of Hilbert spaces corresponding to symmetric 
kernels, from which it follows that / has the form 



Here Pd. n are the c/-dimensional Legendre polynomials and n 

a 2 n < C 2 . This 

allows us to rely on the results for the one-dimensional case from step (1). 

(b) By symmetrizing /, we relax the assumption that / depends only on (x, e). 

(c) By averaging the kernel over the group of linear isometries on IR d , we relax the 
assumption that the kernel is symmetric. 

3. Finally, we show that for the distribution from the previous step, if / is an approximate 
solution to program (6) then / predicts the same value, 1, on instances for which 
(x, e) = 7 and (x, e) = —7. This establishes our claim, as the constructed distribution 
assigns the value —1 to instances for which (x, e) = —7. 

We now expand on this brief description of the main steps. 
The one dimensional distribution 

We define a distribution Don [—1,1] as follows. Start with the distribution D\ that takes the 
values ±(7,1), where Pi (7,1) = 0.7 and T>\{— 7, — 1) = 0.3. Clearly, for this distribution, 
the threshold has zero error rate. To construct T>, we perturb T>\ with "noise" as follows. 
Let T> = (1 — A)X>i + XT>2, where T>2 is defined as follows. The probability of the labels is 
uniform and independent of the instance and the marginal probability over the instances is 
defined by the density function 



00 




n=l 
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This choice of p simplifies our calculations due to its relation to Chebyshev polynomials. 
However, other choices of p which are supported on a small interval around zero can also 
work. 

Note that the error rate of the threshold on T> is A/2. We next show that each poly- 
nomial / of degree K = log(C) that satisfies Errz> )hinge (/) < 1 must have f(pj) rs /(— 7). 
Indeed, if 

1 > Err C)hingc (/) = (1 - A) Err© lihingc (/) + A Err x , 2;hinge (/) 
then Err© 2)hinge (/) < \. But, 

Errx, 2ihinge (/) = \ J lbSn SB (f(x))p(x)dx + \ J k\ agc {- f{x))p{x)dx 
1 f 1 

> g J ^hinge(-\ f{x)\)p(x)dx 

and using the convexity of /hinge we obtain from Jensen's inequality that 

> ^hinge (^J ^ -\f(x)\p(x)dx 



K 1 + /l 



\f(x)\p(x)dx 



2 

>\ jjf{x)\p{x)dx=: l -\\f\\ 1>dp . 

This shows that ||/||i,d p < f ■ We next write / = Ylf=i a iTi, where {Tj} are the orthonormal 
polynomials corresponding to the measure dp. Since Tj are related to Chebyshev polynomials 
we can uniformly bound their norm, hence obtain that 



'E^ 2 = ll/ll 2 ^<o(v^) ||/IU<o 



K 



A 



Based on the above, and using a bound on the derivatives of Chebyshev polynomials, we can 
bound the derivative of the polynomial / 

\f\x)\<Y,\^\\n{x)\<o(^- 

i ^ 

Hence, by choosing A = u^K 3 ) = a; (7 log 3 (C)) we obtain 

1/(7) - /(-7)l < 2 7 max |/'(*)| = O = o(l) , 

as required. 
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Pulling back to the d — 1 dimensional sphere 

Given the distribution V over [—1, 1] x {±1} described before, and some e G S^ 1 , we now 
define a distribution V e on S^ 1 x {±1}. To sample from V e , we first sample (a, (5) from V 
and (uniformly and independently) a vector z from the 1-codimensional sphere of S^ -1 that 
is orthogonal to e. The constructed point is (ae + y/l — a 2 z, (3). 

For any / G Hj, and a G [—1,1] define /(a) to be the expectation of / over the 1- 
codimensional sphere {x G S^ 1 : (x,e) = a}. We will show that for any / G H k) such that 
\\f\\ Hk < C and Err Ce , hingc (/) < 1, we have that |/( 7 ) - /(- 7 )| = o(l). 

To do so, let us first assume that / is symmetric with respect to e, and hence can be 
written as 

oo 
n=0 

where a„Gl and Pd, n is the (i-dimensional Legendre polynomial of degree n. Furthermore, 
by a characterization of Hilbert spaces corresponding to symmetric kernels, it follows that 

Since / is symmetric w.r.t. e we have, 

oo 

f( a ) = a nPd,n( a ) ■ 
n=0 

For \a\ < 1/8, we have that |P^ n (a)| tends to zero exponentially fast with both d and n. 
Hence, if d is large enough then 

log(C) 

f( a ) ~ Yl a ^ P d,n{ a ) =■ /(«) • 
n=0 

Note that / is a polynomial of degree bounded by log(C). In addition, by construction, 
ErrD eihingc (/) = Err2? ihinge (/) w Err Cihingc (/). Hence, if 1 > Err Ceihingc (/) then using the 
previous subsection we conclude that 1/(7) — /(— 7)| = o(l). 

Symmetrization of / 

In the above, we assumed that both the kernel function is symmetric and that / is symmetric 
w.r.t. e. Our next step is to relax the latter assumption, while still assuming that the kernel 
function is symmetric. 

Let O(e) be the group of linear isometries that fix e, namely, O(e) = {A G 0(d) : Ae = e}. 
By assuming that k is a symmetric kernel, we have that for all A G O(e), the function g(x) = 
f(Ax) is also in H^. Furthermore, H^H # fe = ||/||ij fc and by the construction of T> e we also have 
Errx, eihinge (#) = Errx, eihinge (/). Let V e f{x) = J Q(e) f{Ax)dA be the symmetrization of / w.r.t. 
e. On one hand, VJ G H k , \\V e f\\ Hk < \\f\\H k , and Err x , e;hinge (7 ? e /) < Err Peihingc (/). On the 
other hand, / = V e f. Since for V e f we have already shown that \V e f(^) — Vef(—l) \ = o(l), 
it follows that 1/(7) — /(— 7)] = o(l) as well. 



10 



Symmetrization of the kernel 

Our final step is to remove the assumption that the kernel is symmetric. To do so, we first 
symmetrize the kernel as follows. Recall that Q(d) is the group of linear isometries of W 1 . 
Define the following symmetric kernel: 



ka(x,y) = 




k(Ax, Ay)dA . 



We show that the corresponding Hilbert space consists of functions of the form 

/(*) = / f A {Ax)dA , 

Jo(d) 

where for every A f A £ H k . Moreover, 

\\f\\k s < [ WfAkdA. (7) 

Let a be the maximal number such that 

Ve G S d ~ l 3f e G H k s.t. \\f e \\ Hk < C, Err Dcihingc (/ e ) < 1, |/ e ( 7 ) - / e (- 7 )| > a . 
Since H k is closed to negation, it follows that a satisfies 

Ve G S d - l 3f e G H k s.t. \\f e \\ Hk < C, Err De , hinge (/ e ) < 1, / e ( 7 ) - / e (- 7 ) > a . 
Fix some v G S^ 1 and define / G H ks to be 

fix) = / f Av (Ax)dA . 

Jo(d) 

By Equation (7) we have that ||/||H fc < C. It is also possible to show that for all A 
ErrD„ ihin g e (/ j4l , o A) = ErT VAvMngc (f Av ) < 1. Therefore, by the convexity of the loss, 
ErYD v ,hmge(f) < 1- It follows, by the previous sections, that |/( 7 ) — /(— 7 )| = o(l). But, we 
show that /( 7 ) — /(— 7 ) > a. It therefore follows that a = o(l), as required. 



Concluding the proof 

We have shown that for every kernel, there exists some direction e such that for all / G H k 
that satisfies ||/||if fc < C and Errx) ei hi ng e(/) < 1 we have that |/( 7 ) — /(— 7 )| = o(l). 

Next, consider / which is also an (approximated) optimal solution of program (2) with 
respect to V e . Since Err£, e hinge (0) = 1, we clearly have that Errx) ei hin g e(/) < 1, hence 
1/(7) _ /(— 7)1 = o(l). Next we show that /(— 7 ) > 1/2, which will imply that / predicts 
the label 1 for most instances on the 1 co-dimensional sphere such that (x, e) = — 7 . Hence, 
its 0-1 error is close to 0.3(1- A) > 0.2 while Err 7 (£> e ) = A/2. By choosing A = 0( 7 log 3 ' 1 ^)) 



1 



we obtain that the approximation ratio is f2 ^ 7log 3.i^ 

It is therefore left to show that /(— 7) > 1/2. Let a = /( 7 ) ~ /(— 7 ). On (1 — A) fraction 
fraction of the distribution, the hinge-loss would be (on average and roughly) 0.3[1 + a] + + 
0.7[1 — a] + . This function is minimized for a — 1, which concludes our proof since A is o(l). 
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The proof of Theorem 2.6 



To prove Theorem 2.6, we prove, using John's Lemma (Matousek, 2002), that for every em- 
bedding if) : S^ 1 — y Bx, we can construct a kernel k : S^ 1 x S^ 1 — > R and a probability 
measure /i^v over S^ 1 with the following properties: If / is an approximate solution of pro- 
gram (4), where 7 fraction of the distribution V is perturbed by /i^, then ||/|| fc < O (^^j- 
Using this, we adapt the proof as sketched above to prove Theorem 2.6. 



3 Additional Results 

Low dimensional distributions. It is of interest to examine Theorem 2.5 when T> is 
supported on B d for d small. We show that for d = 0(log(l/7)), the approximation ratio is 

Q v ^Y-poi y (iog(i/-7)) ^) • M- ^ commonly used kernels (e.g., the polynomial, RBF, and Hyperbolic 
tangent kernels, as well as the kernel used in (Shalev-Shwartz et al., 2011)) are symmetric. 
Namely, for all unit vectors x,y 6 B, k(x,y) := (ip(x), ipiy))^ depends only on (x,y)n- 
For symmetric kernels, we show that even with the restriction that d = 0(log(l/7)), the 

approximation ratio is still Q f 7 . po ]y(iog(i/7)) J • H° wever ; the result for symmetric kernels is 
only proved for (idealized) algorithms that return the exact solution of program (5). 

Theorem 3.1 Let A be a kernel-based learner corresponding to a Lipschitz surrogate. As- 
sume that Ca{^) = exp(o(7 -1 / 8 )). Then, for every 7 > 0, there exists a distribution V on 
B d , for d = 0{\og(CA{l)h)), suc h that, w.p. > 1 — exp(l/7), 

Err2>,o-i(A(7)) > Q 



Err 7 (D) " V ' poly(log(C j4 ( 7 )) 

Theorem 3.2 Assume that C = exp(o(7~ 2//7 )) and ip is continuous and symmetric. For 
every 7 > ; there exists a distribution T> on B d , for d = 0(log(C)) and a solution to 

program (5) whose 0-1-error is f2 ^ 7po i y (i g(c)) J ' Err 7 (D). 

The integrality gap. In bounding the approximation ratio, we considered a predefined 
loss I. We believe that similar bounds hold as well for algorithms that can choose I according 
to 7. However, at the moment, we only know to lower bound the integrality gap, as defined 
below. 

If we let I depend on 7, we should redefine the complexity of Program (5) to be C ■ L, 
where L is the Lipschitz constant of I. (See the discussion following Program (5)). The (7- 
jintegrality gap of program (5) and (4) is defined as the worst case, over all possible choices 
of T>, of the ratio between the optimum of the program, running on the input 7, to Err 7 (T>). 
We note that Errx> i o-i(/) < Err £),/(/) for every convex surrogate I. Thus, the integrality gap 
always upper bounds the approximation ratio. Moreover, this fact establishes most (if not 
all) guarantees for algorithms that solve Program (5) or Program (4). 

We denote by d + f the right derivative of the real function /. Note that d + f always exists 
for / convex. Also, Vx G R, \d + f(x)\ < L if / is L-Lipschitz. We prove: 
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Theorem 3.3 Assume that C = exp (0(7 2 / 7 )) andxjj is continuous. For every 7 > ; there 
exists a distribution T> on B d , for d = 0(log(C)) such that the optimum of Program (5) is 

Q ( 7P oiy(iog(cH0 + 2(o)|))) ' Err 7 (D). 

Thus Program (5) has itegrality gap ^ 7po i y ( 1 o g (c-L)) ) • ^or Program (4) we prove a similar 
lower bound: 

Theorem 3.4 Let m,d,j such that d = u(log(m/j)) and m = exp (o(7 _2//7 )). There exist 
a distribution T> on S^ 1 x {±1} such that the optimum of Program (4) is ft ( 7po i y (iog( m / 7 )) 
Err 7 (£>). 



4 Conclusion 

We gave lower bounds on the worst-case approximation ratio of the family of methods in 
which one minimizes a convex surrogate loss over afhne functionals in some feature space. 
This applies to many popular algorithms such as kernel SVM, with any kernel, and the Lasso, 
with any feature mapping. Furthermore, our lower bound nearly matches the best known 
upper bound. Our analysis suggests that if better approximation ratios are at all possible, 
they would require methods other than optimizing a convex surrogate over a regularized 
linear class of classifiers. 

The main spirit of our results is that from a worst case perspective, Kernel-SVM is a 
rather bad algorithm. Nevertheless, it is an empirical fact that Kernel-SVM performs very 
well in practice. Therefore, it is of great interest to find distributional conditions that hold 
in practice, under which Kernel-SVM is guaranteed to perform well. We note that learn- 
ing halfspaces under distributional assumptions, has already been addressed. For example, 
(Kalai et al., 2005, Blais et al., 2008) show positive results under several assumptions on the 
marginal distribution. However, to the best of our knowledge, these results yield runtime 
which is exponential in poly(l/e), where e is the excess error of the learnt hypothesis. 

Open questions: Our analysis is limited in several ways. First, we assumed that the 
surrogate loss is fixed in advance, but we believe that results similar to ours hold even if 
the loss may depend on 7. This belief is supported by our results about the integrality 
gap. As explained in Section 6, this is somewhat subtle issue and related to questions about 
sample complexity. Also, we refer to C as the complexity of Program (3) since the analysis of 
uniform convergence requires a sample of size poly(C) in order to solve the problem based on 
a finite sample. Our results do not rule out the possibility of choosing C that is exponentially 
large in I/7 and still using a polynomial sample. We believe that this approach is doomed 
to fail due to over-fitting. Third, in view of Theorems 3.3 and 3.4, we believe that the lower 
bound in Theorems 2.6 and 3.1 can be improved to depend on ^ rather than on -^=, as in 
Theorem 2.5. 
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5 Proofs 



5.1 Background and Notation 

Here we introduce some notations and terminology to be used throughout. The L p norm 
corresponding to a measure \i is denoted 1 1 ■ | | Pi(U . Also, N = {1,2,...} and No = {0, 1,2,.. .}. 

5.1.1 Reproducing Kernel Hilbert Spaces 

All the theorems we quote here are standard and can be found, e.g., in Chapter 2 of (Saitoh, 
1988). Let if be a Hilbert space of functions from a set S to C. Note that H consists of 
functions and not of equivalence classes of functions. We say that if is a reproducing kernel 
Hilbert space (RKHS for short) if, for every x G S, the linear functional f —> f(x) is bounded. 

A function k : SxS — > C is a reproducing kernel (or just a kernel) if, for every x\, . . . , x n G 
S, the matrix {k(xi, Xj)}i<i,j<n is positive semi-definite. 

Kernels and RKHSs are essentially synonymous: 

Theorem 5.1 1. For every kernel k there exists a unique RKHS H k such that for every 
yeS, k(-,y) G H k and V/ G H, f(y) = (f(-),k(-,y)} Hk . 

2. A Hilbert space H C C 5 is a RKHS if and only if there exists a kernel k : S x S — >■ R 
such that H = H k . 

3. For every kernel k, span{&;(-, y)} y ^s = Hk- Moreover, 

n n 

(^2a i k(-,x i ),^2l3 i k(-,y i )) Hk = ^ a SMv^ x i) 

i=l i=l l<i,j;',<Ti 

4- If the kernel k : S x S — > M. takes only real values, then Hf := {Re(/) : / G H k } C H k . 
Moreover, Hf is a real Hilbert space with the inner product induced from H k . 

5. For every kernel k, convergence in H k implies point-wise convergence. If 
sup^gs k(x,x) < oo then this convergence is uniform. 

There is also a tight connection between embeddings of S into a Hilbert space and RKHSs. 

Theorem 5.2 A function k : S x S — )■ K is a kernel iff there exists a mapping <fi : S — > H 
to some real Hilbert space for which k(x,y) = (<f>(y), 4>{x))h- Also, 

H k = {f v :ve H} 

Where f v (x) = (v, 4>{x))h- The mapping v ^ f v , restricted to span(0(S 1 )), is a Hilbert space 
isomorphism. 

A kernel k : S x S — > R is called normalized if sup^g^ k(x, x) = 1. Also, 

Theorem 5.3 Let k : S x S —¥ R be a kernel and let {/ ra }^Li be an orthonormal basis of a 
H k . Then, k(x,y) = J2n=i fn(v) ■ 
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5.1.2 Unitary Representations of Compact Groups 



Proofs of the results stated here can be found in (Folland, 1994), chapter 5. Let G be a 
compact group. A unitary representation (or just a representation) of G is a group homo- 
morphism p : G — > U (H) where U(H) is the class of unitary operators over a Hilbert space 
H, such that, for every v G H, the mapping a h-> p(a)v is continuous. 

We say that a closed subspace M C H is invariant (to p) if for every a G G, t> G M, 
p(a)v G M. We note that if M is invariant then so is M x . We denote by p\m '■ G — > U(M) 
the restriction of p to M. That is, Va G G, p\m{.o) = p(o)\m- We say that p : G —¥ U(H) is 
reducible ii H = M@ M L such that M , M - 1 are both non zero closed and invariant subspaces 
of H. A basic result is that every representation of a compact group is a sum of irreducible 
representation. 

Theorem 5.4 Let p : G — > U(H) be a representation of a compact group G. Then, H = 
@n£iH n , where every H n is invariant to p and p\n„ is irreducible. 

We shall also use the following Lemma. 

Lemma 5.5 Let G be a compact group, V a finite dimensional vector space and let p : G — >• 
GL(V) be a continuous homomorphism of groups (here, GL(V) is the group of invertible 
linear operators over V). Then, 

1. There exists an inner product on V making p a unitary representation. 

2. Moreover, if V has no non-trivial invariant subspaces (here a subspace U dV is called 
invariant if, Va G G, f G U, p(a)f G U ) then this inner product is unique up to scalar 
multiple. 

5.1.3 Harmonic Analysis on the Sphere 

All the results stated here can be found in (Atkinson and Han, 2012), chapters 1 and 
2. Denote by O(d) the group of unitary operators over M, d and by dA the uniform 
probability measure over O(d) (that is, dA is the unique probability measure satisfying 
/ 0(d) f(A)dA = J Q(d) f(AB)dA = f Q{d) f(BA)dA for every B G Q(d) and every integrable 

function / : O(d) — > C). Denote by dx = dxd-i the Lebesgue (area) measure over S' d ~ 1 and 
let L 2 (S d ^ 1 ) := L 2 (S d ~ 1 ,dx). Given a measurable set Z C S , we sometime denote its 
Lebesgue measure by \Z\. Also, denote dm = i s t-ii the Lebesgue measure, normalized to be 
a probability measure. 

For every n G No ; we denote by the linear space of d-variables harmonic (i.e., satisfying 
Ap = 0) homogeneous polynomials of degree n. It holds that 



Denote by V^ n '■ L 2 (S d : ) —> Y d the orthogonal projection onto Y^. 

We denote by p : O(d) — > U(L 2 (S d ~ 1 )) the unitary representation defined by 
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p(A)f = f o A 



-i 
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We say that a closed subspace M C L 2 (S d ~ l ) is invariant if it is invariant w.r.t. p (that is, 
V/ G M, A G 0(d), f o A G M). We say that an invariant space M is primitive if p|m is 
irreducible. 

Theorem 5.6 1. L 2 (S d ^) = ®%L Y d . 

U. The primitive finite dimensional subspaces of L 2 (S d ~ l ) are exactly {Y^}^1 . 

Lemma 5.7 Fix an orthonormal basis Y d j, 1 < j ' < N d)TL to Y^. For every x G S^ 1 it 
holds that 

N d , n 
3=1 



\S d ~ 



5.1.4 Legendre and Chebyshev Polynomials 

The results stated here can be found at (Atkinson and Han, 2012). Fix d > 2. The d 
dimensional Legendre polynomials are the sequence of polynomials over [—1,1] defined by 
the recursion formula 

P dfl = 1, Pd,i{x) = x 
We shall make use of the following properties of the Legendre polynomials. 

Proposition 5.8 1. For every d > 2, the sequence {P d ,n} is orthogonal basis of the 
Hilbert space L 2 ([— 1, 1], (1 — x 2 )^dx^ . 

2. For every n,d, ||Prf, n ||oo = 1- 

The Chebyshev polynomials of the first kind are defined as T n := P 2 n . The Chebyshev 
polynomials of the second kind are the polynomials over [—1, 1] defined by the recursion 
formula 

U n (x) = 2xU n -i(x) - U n - 2 {x) 
U = 1, U x (x) = 2x 

We shall make use of the following properties of the Chebyshev polynomials. 

Proposition 5.9 1. For every n > 1, T' n — 7iZ7 n _i. 

2. \\U n \\oo = n + 1. 

Given a measure /i over [—1,1], the orthogonal polynomials corresponding to \i are the se- 
quence of polynomials obtained upon the Gram-Schmidt procedure applied to 1 
We note that the 1, V2T%, \[2T%, V2T3, . . . are the orthogonal polynomials corresponding to 
the probability measure djx = ^Jy^i 
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5.1.5 Bochner Integral and Bochner Spaces 

Proofs and elaborations on the material appearing in this section can be found in (Kosaku 
Yosida, 1963). Let (X, m, fi) be a measure space and let if be a Hilbert space. A function 
/ : X — > H is (Bochner) measurable if there exits a sequence of function f n :X^H such 
that 

• For almost every x G X, f(x) = lim^oo f n (x). 

• The range of every /„ is countable and, for every v G H, f~ l iy) is measurable. 

A measurable function / : X — > H is (Bochner) integrable if there exists a sequence of simple 
measurable functions (in the usual sense) s n such that lim^oo f x \ \f(x) — s n (x)\\Hdp:(x) = 0. 
We define the integral of / to be f x fdp = lim^oo f s n dp, where the integral of a simple 
function s = Yn=i l A^h A £ m, v { G H is j x sdfi = J27=i K^i) v i- 

Define by L 2 (X,H) the Kolmogorov quotient (by equality almost everywhere) of all 
measurable functions / : X — > H such that J x \\f\\%dfj, < oo. 

Theorem 5.10 L 2 (X,H) in a Hilbert space w.r.t. the inner product (f , g) l 2 {x,h) — 
Jx(f( x )i9(x))Hdn(x) 

5.2 Symmetric Kernels and Symmetrization 

In this section we concern symmetric kernels. Fix d > 2 and let k : S^ 1 x S^ 1 — > R be a 
continuous positive definite kernel. We say that k is symmetric if 

MA G Q(d),x,ye S d ~\ k(Ax,Ay) = k(x,y) 

In other words, k(x,y) depends only on (x,y) R d. A RKHS is called symmetric if its repro- 
ducing kernel is symmetric. The next theorem characterize symmetric RKHSs. We note 
that Theorems of the same spirit have already been proved (e.g. (Schoenberg, 1942)). 

Theorem 5.11 Let k : S^ 1 x S^ 1 — > R be a normalized, symmetric and continuous kernel. 
Then, 

1. The group 0(d) acts on Hk- That is, for every A G 0(d) and every f G Hk if holds 
that foAe H k and \ \f\\ Hk = \\f a A\\ Hk . 

2. The mapping p : 0(d) —> U (Hk) defined by p(A)f = foA^ 1 is a unitary representation. 

3. The space H^ consists of continuous functions. 

4- The decomposition of p into a sum of irreducible representation is H = © ng /Y^ for 
some set I C No- Moreover, 

Vf,9 e H k} (f,g) Hh = ^a 2 n (V d , n f,V d , n g} ^^-1) 
Where {a n } n£ i are positive numbers. 
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5. It holds that J2 n ei Js^=T~\ a n 2 = 1- 
Proof Let / G H^, A G O(d). To prove part 1, assume first that 

n 

VxG5 d -\ f(x) = TT / a i k(x,y i ) (9) 

For some yi, . . . ,y n G S^ -1 and a 1; . . . , a n G C. We have, since is symmetric, that 

foA(x) = ^2aik(Ax,yi) 
i=i 

n 

= a i k(A~ 1 Ax, A~ x yi) 

i=l 

i=i 

Thus, by Theorem 5.1, / o A G Moreover, it holds that 

ll/°^lk = Yl OiOCjkiA-^A^yi) 

l<i,j<n 

= a i®MvvVi) = \\f\\ 2 H k 

l<i,j<n 

Thus, part 1 holds for function / G of the form (9). For general / G by Theorem 5.1, 
there is a sequence f n G of functions of the from (9) that converges to / in H^. From what 
we have shown for functions of the form (9) if follows that 1 1 f n — f m \ \ n k = \ \ f n ° A — f m o A \ \ Hk , 
thus f n o A is a Cauchy sequence, hence, has a limit g G H^. By Theorem 5.1, convergence 
in Hk entails point wise convergence, thus, g — f o A. Finally, 

I l/lk = lim ll/nlk = lim ll/no^lk = ll/°^lk 

n— >oo n— >oc 

We proceed to part 2. It is not hard to check that p is group homomorphism, so it 
only remains to validate that for every / G H the mapping A H- p(A)f is continuous. Let 
e > and let A G O(d). We must show that there exists a neighbourhood £/ of A such that 

V£? G £/, H/oA-W^-'lk < e - Choose (?(•) = Er=i«^(-^i) such that |b~/|k < |- 
By part 1, it holds that 

\\f o A- 1 - f o B^Wh, < WfoA-'-goA-'W^ + WgoA-'-goB-'W^ + WgoB-'-fo 

= Wf-glW + WgoA-'-goB-'W^ + Wg-fW^ 
< e - + \\goA- l -goB- 1 \\ Hk + e - 
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Thus, it is enough to find a neighbourhood U of A such that MB E U, \\goA — goB \\n k < 
f. However, 



\g o A 1 - g o B 1 \\ 2 Hk = \\g oA 



-ii 



2 



\goB 



- 1 ||L-2Re 



n 

{J2 Oik(; Vi) o A~\ J2 (*ik(; Vi) o B- 1 } 



i=l 



2||ao -2Re 



(J^ OiA;(-, Ay t ), ^ a M'i B Vi)) 



»=i 



i=l 



2||g o - Re 



OLiOLjk[Byj,Ayi) 



Since is 
sion tends to 2||g o A -1 !^ 



continuous, 



the 



last 



Re 



\g o A — g o A 



|2 

?-H 



expres- 
= as 



B A. Thus, there exists a neighbourhood £/ such that VS G U, \\go A — go B \\n k < f 
as required. 

To see part 3, note that every function in H k is a limit in H k of functions of the form 
(9). Since k is continuous, every function in H k is a limit in H k of continuous functions. 
However, by Theorem 5.1, every function is in fact a uniform limit of continuous function, 
thus - continuous itself. 

We proceed to part 4. By Theorem 5.4 Hk = (Bi^iV, where each Vi is a finite dimensional 
space that is invariant to p. By Theorem 5.6 each V, must be Y n for some n, thus, H = 
©„ e /Y^. By the uniqueness part in Lemma 5.5 and Theorem 5.6, the restriction of (-, -)ii k to 
each Y^, n El equals to (•, -) L 2( S d-i) up to scalar multiple, proving the formula for (•, -)# fe 

Finally, to see equation part 5, note that if for every n E I, {Yn,j}j£[N d „] i n an orthonormal 
basis of Y^ w.r.t. (•, ■)l 2 (S' 1 - 1 ) then {—Y^j} ne ij e [N d „] is an orthogonal basis of H. By 
Theorem 5.3 and Lemma 5.7, it follows that, for every x E iS* , 



AT d , 



k(x, x) 



N, 



£a; 2 £(^)) 2 =E^ a ~ 2 



nel 



3=1 



\s d - 



1 1 



□ 



Symmetrization 

Let k : S^ 1 x S^ 1 — > R be a normalized continuous kernel. We define its symmetrization 
by 

Vx,y E S d ~\ h s (x,y) = / fc(Az, A/)gL4 

</Q(d) 

Theorem 5.12 1. k s is symmetric continuous kernel with sup^^-i k s (x,x) < 1. 

2. For every $ E L 2 (Q(d),H k ) define <5 : S^ 1 ^ C by <5(x) = L d) $(A)(Ax)dA. Then 

H ks = {$:<S>EL 2 (0(d),H k )} 



Moreover, for every $ E L 2 (0(d),H k ), \\®\\ Hks < \\®\\L*(0(d),H k )- 
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Proof Part 1. follows readily from the definition. We proceed to part 2. Define <fi : S d 1 — > 
L 2 (0(d),H k )hy 

cf>{x)(A)(-) = k{Ax,-) 



Note that 



(<f>{x),<f){y)) L 2 {0{d)iHk) 



(<Kx)(A),<f>{y){A)) 



{<P(x)(A)^{y){A)) 

O(d) 

= h(x,y) 

Thus, the Theorem follows from Theorem 5.1.1 
5.3 Lemma 5.16 and its proof 

Lemma 5.13 For every n > 0, d > 5 and i G [—1, 1] it holds that 

r(¥) 



□ 



d-2 
2 



\Pdn(t)\ < min . ,_ 
1 - 1 - n(l-t 2 )_ 

Moreover, if n , . + 2\t\ < 1 we also have 



n + d - 2 



+ 2|t| 



l^,n(0l < 



\Jl\i + d-2 



2\t\ 



Finally, there exist constants E > and < r, s < 1 such that for every K > 0, d > 5 and 
t G [— |, |] we /iaue 

oo 

^|Pd,n(t)|<^ + E S d 
n=i< r 



Proof In (Atkinson and Han, 2012) it is shown that \P^ n {t)\ < 
shall prove, by induction on k that 



n(l-t 2 ) 



. We 



l^,n(*)| < 



\ 



n 

i=i 



i + d-2 



2\t\ 



Whenever „ + 2\t\ < 1. For n = 0,1 it follows from the fact that P,vo = 1 and 
-P<i,i(£) = Let n > 1. By the induction hypothesis and the recursion formula for the 
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Legendre polynomials we have 
2n + d - 4 , 



\PaAt)\ < 



n + d — 3 



l^,n-l(*)l + 



n-l 
n + d — 3 



-P(i,n-2(0I 



< 2|i||P d , n _ 1 (i)| + 



n-l 
n + d — 3 



< 2|f|. 

< 2\t\- 



n-l 

N5 



i + d-2 



P d ,n-2(t)\ 

n-l 



n-2 



i + d-2 



2|*| + 



n + d — 3 ' 

n — 1 
n + d — 3 



n-2 



i + d-2 



+ 2\t\ 



n-2 



i + d-2 



2|*| 



< W 2|t| + 



n — 1 

n + d — 3 



2|t| + 



n + d-2 



n-2 



i + d-2 



+ 2\t\ 



i + d-2 



+ 2\t\ 



Now, every K, K > such that 



K 



K+d-2 



n=K 



< 



< 



n=K 
K 



+ 2\t\ ) < 1, we have 



K 



n=K 

oo 



n=K+l 



d-2 



r(fci) 



s E 



n=K 



K 



K + d-2 



+ 2\t\ 



n=K+l 

1 , r(¥) 



7T 



(1-^) 



n(l -t 2 )_ 
4 

n(l-t 2 ) 

- oo 

E 



d-2 

n 2 



n=ftT+l 



(*fe + 2|'l) T r(^) 

l ~r 



A' 



A"+d-2 



2|f| 



l 21 

K+d-2 ~ I 



r 



7T 



'd-r 



7T 



(t^ + 2W) t r(^) 



K+d-2 



T + 



K _|_ ~ ' > ; - 



7T 



(1-t 2 ) 



(1-t 2 ) 



E 

n=AM-l 



d-2 

n 2 



X 2 dx 



K 



d=2 d _ 4 
2 K 2" 



2 
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(We limit ourselves to d > 5 to guarantee the convergence of J2 n 2 •) m particular, if 
\t\ < \ and K = d — 2, we have, 



n=K 



< 




+ 



+ 



+ 



r(¥) 



7T 



4.07 

(d = 2) 

4.07 

(d = 2) 



d-2 

2 



4.07 

(d = 2) 



d-2 

2tt /d — 2^ 2 



d-2 
2 



2e 







"4.07" 




2 +12 






. 2e . 



d-2 
2 



□ 



Lemma 5.14 Lei /i be a probability measure on [—1, 1] and letpo,Pi, ■ ■ ■ be the corresponding 
orthogonal polynomials. Then, for every f G span{p , • • • ,Pk-i} we have 



2 < vK / i • max \\pi\ 

0<i<K-l 



Here, all L p norms are w.r.t. fi. 

Proof Write / = Y^h=q a iPi an d denote M = max <j<x-i ||pj||oo-We have 



< 



< 



i • IIJ Moo 

K-l 
n=0 



Otk\ 



< \\f\\i-M 

U • M . \\f\\ 2 VK 



K-l 



\ n=0 



□ 



Lemma 5.15 Let d > 5 and let f : [—1, 1] — > R be a continuous function whose expansion 
in the basis of d-dimensional Legendre polynomials is 

oo 

/ = oi n p d , n 

n=0 

Denote C = sup n \a n \. Let ji be the probability measure on [—1, 1] whose density function is 



w[x) 



1 8 


kl > I 


I TT V /l-(8x) 2 


22 





Then, for every K G N, | > 7 > 0, 

1/(7) - /(-7)l < 32 7 ^ 3 - 5 • H/iii,, + (32 7 iT 3 - 5 + 2) • C ■ E • (r K + s d ) 
Here, E,r and s are the constants from Lemma 5.13. 

Proof Let _/= E^ 1 a„P d , n . We have < ||/||i, /1 + ||/-/|| 00) „. Define : [-1,1] -> 

by #(£) = /(f) and denote by dA = ^qp- Write, 



= ^2 ^ nTn 



n=0 



Where T n are the Chebyshev polynomials. By Lemma 5.14 it holds that, for every < n < 
K-l, ' 

\Pn\ < V2\\g\\ 2 , x < 2VK\\g\\ ltX = 2VK\\f\\^ 



Now, 

K-l 



9 =^2 fi knUn 



>n-l 

n=l 



Where U n are the Chebyshev polynomials of the second kind. Thus, 

K-l K-l 

\W\\oo,X < Yl l&l ■ 71 ■ H^-llloo,A = Yl 1^1 • n2 ^ 2 ^ll/lll,M ■ K 
n=l n=l 

Finally, by Lemma 5.13, 



3 



< 




-8 7 )|+ 2||/ -/lu. 


< 


32 7 if 3 - 5 ■ || 


/||l,M + 2||/-/||oo >A4 


< 


32-fK 3 5 ■ ( 


|/iii, M + ll/-/IU)+2||/-/lU 


< 


32 7 K 3 - 5 • || 


/|| 1 , / ,+ (32 7 K 3 - 5 + 2)-||/-/|| 00 , M 


< 


32 7 fs: 3 - 5 • || 


f\\ 1 ^+(32 1 K 3 - 5 + 2)-E-C-(r K + s d 



□ 

For e G S^ 1 we define the group O(e) := {A G 0(d) : Ae = e}. If Hk be a symmetric 
RKHS and e G S^ 1 we define Symmetrization around e. This is the operator V e : Hk — > Hk 
which is the projection on the subspace {/ G Hk : VA G O(e), / o A — /}. It is not hard 
to see that (V e f){x) = f {x , :{x , )e)={x<e)} f(x')dx' = / 0(e) / o A(x)dA. Since V e f is a convex 
combination of the functions {/ o A}^ e o( e ), it follows that if 71 : Hk — > R is a convex 
functional then 7£(P e /) < J 0(e) o A)tL4. 

Lemma 5.16 (main) There exists a probability measure fi on [—1,1] with the following 
properties. For every continuous and normalized kernel k : S^ 1 x S^ 1 — > R and C > 0, 
there exists e G S^ 1 such that, for every / G Hk with \\f\\H k < C, K G N and < 7 < | ; 



/ /"/ / < 32 7 K 3 - 5 -||/|| liMe + (32 7 K 3 ' 5 + 2)-E-C-(r^ + S Q 

< 32 7 K 3 - 5 • H/lli,^ + 10 ■ £ • if 35 • C - (r* + s d ) 
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The integrals are w.r.t. the uniform probability over {x : (x,e) = 7} and {x : (x, e) = —7} 
and E,r, s are the constants from Lemma 5.13. 

Proof Suppose first that k is symmetric. Let \i be the distribution over [—1, 1] whose 
density function is 



w(x) 



\x\ > I 

Ixl < 1 



ny/l-(Sx) 2 ~ 8 

We can assume that / is 0(e)-invariant. Otherwise, we can replace / with V e f, which does 
not change the l.h.s. and does not increase the r.h.s. This assumption yields (see (Atkinson 
and Han, 2012), pages 17-18) 

00 

f( x ) = ^2oi n P djn ({e,x)). 

n=0 



The L 2 (S d 1 )-norm of the map x t— > P^ n {(x,e)) is — - (e.g. (Atkinson and Han, 2012), 
page 71). Therefore, 



ad- 11 

2 _ 1° I 2 2 



k ~ ^ N d , n 

where {a n } ne / are the numbers corresponding to from Theorem 5.11. In particular (since 
also for n $ I, a n = 0), 

K\ 2 <^a- 2 \\f\\i<\\f\\i 

Write 

g(t) = f(te), te[-l,l] 

By Lemma 5.15, 

\g(l) - 9(-l)\ < 32 7 ^ 3 - 5 • \\f\\i,, + (32 7 K 3 - 5 + 2) ■ E ■ C ■ (r K + s d ) 

Finally, j {x . {x ^ } f = 3(7), J {x:M= _ y} f = g(~j) since / is 0(e)-invariant. The Lemma 
follows. 

We proceed to the general case where k is not necessarily symmetric. Assume by way of 
contradiction that for every e G S x , there exists a function f e such that 

/ fe~ f / e > 327ir 3 - 5 -||/ e ||i,, e + (327K 3 - 5 + 2)-||/ e |k,-C-(r^ + S d ) (10) 

J {x:{x,e)=^i} J {x:{x,e)=-~t} 

For convenience we normalize, so l.h.s. equals 1. Fix a vector eo G S l . Define $ G 
L 2 (Q(d),H k )hy 

HA) = f Aeo 

and let / G Hk 3 be the function 

f(x)= [ $(A)(Ax)dA= [ f Aeo (Ax)dA 

JO(d) JO(d) 
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Now, it holds that 



/- / / 

{x:(x,eo)=-y} J {x:(x,eo)=-'y} 



f Aeo (Ax)dAdx - / f Aeo (Ax)dAdx 

{x:{x,e )=-y} J O(d) J {x:{x,e )=-'y} JO(d) 



f Aeo (Ax)dx- f Aeo (Ax)dxdA 

O(d) J {x:(x,e )=7} J {x:{x,e )=-'y} 



f Aeo (x)dx- / f Aeo (x)dxdA 

O(d) J{x:{x,Ae }=-/} J {x:{x,Ae }=-~{} 



On the other hand 



gd-1 



O(d) 



dfi eo (x) 



< I I \f A e {Ax)\d/j, eo (x)dA 



< 



5 d-i 



|/Ae (^)| dfi Aeo (x)dA 



u 



Aeo I |l,MAe 



<L4 



Moreover, by Theorem 5.12, 



\\f\\l. <ll$" 2 



[ WfAe \\ 2 Hk dA<C 2 
JO(d) 



Since the Lemma is already proved for symmetric kernels, it follows that 
1 < ?>2 1 K^-\\f\\ 1 ^ o + {?>2 1 K^ + 2)-E-C-{r K + s d ) 

< 32 7 iT 3 - 5 • / \\f Aeo \\i,, A dA + (32 7 iT 3 - 5 + 2) ■ E ■ C ■ (r K + s d ) 

JO(d) 

= [ 32 7 if 3 ' 5 • \\f Aeo \\i,, Aeo + (32 7 K 3 - 5 + 2) ■ E ■ C ■ (r K + s d )dA 

JO(d) 

Thus, for some A e O(d) 

1 < 32 7 iT 3 - 5 • H/Aeolli,^ + (32 7 K 3 - 5 + 2) • E ■ C ■ (r K + s d ) 
Contradicting Equation (10). 



□ 



5.4 Proofs of the main Theorems 

We are now ready to prove Theorems 2.5 and 3.2. We only consider distributions that 
supported on the unit sphere, and we can therefore assume that the problem is formulated 
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it terms of the unit sphere and not the unit ball. Also, we reformulate program (5) as 
follows: Given I : R — > R a convex surrogate, a constant C > and a continuous kernel 
k : S°° x S°° — > R with sup^goo k(x, x) < 1, we want to solve 

min Err^/ (/ + b) 

s.t. feH k ,beR (11) 
k < C 



i, M <^7^Err^(/) 



We can assume that <9+Z(0) < 0, for otherwise the approximation ratio is oo. To see that, 
let the distribution T> be concentrated on a single point on the sphere and always return the 
label 1. Of course, Err 7 (P) = 0. However, if d + l(0) > 0, it is bot hard to see that if f,b is 
the solution of program (11), then f(x) + b < 0, so that Erro_i(/ + b) = 1. 

Lemma 5.17 Let I be a surrogate loss, fi a probability measure on S^ 1 and f e C(S ,d_1 ). 
Let p, be the probability measure on S^ 1 x {±1} which is the product measure of fi and the 
uniform distribution on {±1}. Then 

2 

IM0)| 

Proof By Jansen's inequaliy, it holds that 

Err^(/) = E^^y ■ f(x)) 

= Ie ( ^/(/(x)) + /(-/(x)) 

> ±E {Xiy) ^l(-\f(x)\) 

> i/(-E ( ^|/(x)|) 

It follows that I (— ||/||i^) < 2Err^ 5 /(/). By the convexity of I, it follows that for every 
x e R, l(x) > 1(0) + x ■ d + l(0) = 1(0)- x ■ \d+l(0)\ > -x ■ \d+l(0)\. Thus, 

2 

□ 



5.4.1 Theorems 2.5 and 3.2 

We will need Levy's measure concentration Lemma (e.g., (Milman and Schechtman, 2002)). 
Let / : X — > Y be an absolutely continuous map between metric spaces. We define its 
modulus of continuity as 

Ve > 0, u f (e) = sup{d(f(x), f(y)) : x, y e X, d(x, y) < e} 

Theorem 5.18 (Levy's Lemma) There exists a constant rj > such that for every con- 
tinuous function f : S^ 1 — > R 7 

Pr(|/-E/| >u f (e)) <exp(-^e 2 ) 

Here, both probability and expectation are w.r.t. the uniform distribution. 
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We note that Uf og <UfUj g and that Co>A„(e) = ||t>|| -e. Thus, if ip : S°° —> Hi is an absolutely 
continuous embedding such that k(x,y) = (ip(x),ip(y))H 1 , then for every v G H\, it holds 
that 0Jk vfi oip < II^Hhi -UJ ip- Suppose now that / G with H/H^ < C. Let v E Hi such that 
/ = A^o ° ip and \\v {{^ = \\f\\H k — C. It follows from Levi's Lemma that 

Pr (|/ - E/| > C ■ £^(e)) < Pr (|/ - E/| > W/ (e)) < exp (- V de 2 ) (12) 

Again, when both probability and expectation are w.r.t. the uniform distribution over S^ 1 . 
Proof (of Theorems 2.5 and 3.2) Let (3 > a > such that 1(a) > l(/3). Choose < 8 < 1 
large enough so that (1 — 8)l(—(3) + 8l({3) < 81(a). Define probability measures /x 1 , /i 2 , \i over 
[— 1, 1] x {±1} as follows. 

/ i 1 ((-7,-l)) = l-^, a* 1 ((7,1)) = 

The measure /i 2 is the product of uniform{±l} and the measure on [—1,1] whose density 
function is 

fo V\>\ 

Finally, /i = (1 — A)/! 1 + A/i 2 for A > 0, which will be chosen later. 

Let e G be the vector from Lemma 5.16. The distribution D is the pullback of fi 
w.r.t. e. By considering the affine functional A 6j o, it holds that Err 7 (D) < A. 

Let g be the solution returned by the algorithm. With probability > 1 — exp(— 1/7), 
g — f + b, where /, b is a solution to program (11) with C = C^(j) and with an additive 
error < ypy. Since the value of the zero solution for program (11) is 1(0), it follows that 

1(0) + Vl> Enw(0) = (1 - A) Err^) + A Err^,,^) 
Thus, Err M 2^((y() < ^2l±y^ < Mo)^ Combining Lemma 5.17 and Lemma 5.16 is follows that 

/ 9 _/ 9 <H|(^! + 1 o. /f 3,. £ . c . (r , + si) 

J {x:{x,e)=~i} J{x:(x,e)=-y} |C+H U JI A 

By choosing K = 0(log(C)), A = 6 (^K^) = 9 ( 7 log 3 - 5 (C)) and d = 9(log(C)), we can 
make the last bound < %. We claim that L , , , q > %. To see that, note that otherwise 

— 2 J {x:(a',e)=— 7} ^ 2 ' 

I{ X :(x,e)=j} 9<^ thus, 

E(x,„)~2jZ((/(x) + 6)y) = E {Xjy) ^ v l(g(x)y) 

> 8(1 - A) • I l(g(x))dx 

J {x:(x,e)= , y} 

> 8(1 - A) • I ( [ g(x)dx) 

> 8-l(a)-(l-X) = 8-l(a) + o(l) 

This contradict the optimality of /, b, as for /' = 0, b' — /3 it holds that 

E (x ^ v l((f(x) + b')y) < \l(—/3) + (1 - A) • (1 - 8)l(-(3) + 8 ■ I (ft)) 

= (l-8)l{-(3) + 8-l(/3) + o{l) 



27 



We can conclude now the proof of Theorem 2.5. By choosing d large enough 
and using Equation (12), we can guarantee that g\{ X :(x,e)=-i\ is very concentrated 
around its expectation. In particular, if (x, y) are sampled according to V, then w.p. 
> 0.5 • (1 - 9) - (1 - A) = 0(1), it holds that yg(x) < 0. Thus, Err© 0-1(0) = 0(1), while 
Err 7 (P)< A = 0( 7 poly(log(C))) 

To conclude the proof of Theorem 3.2, we note that we can assume that g is O(e)- 
invariant. Otherwise, we can replace it with V e f + b. This does not increase H/H/fj. nor 
Errxy(/ + b), thus, the solution V e f + b is optimal as well. Now, it follows that g\{ X :{x,e)=--y} 
is constant and we finish as before. 

□ 



5.4.2 Theorem 3.1 

Let L be the Lipschitz constant of /. Let > a > such that 1(a) > 1(13). Choose < 9 < 1 
large enough so that (1— 9)l(— (3)+9l(f3) < 91(a). First, define probability measures /i 1 , /x 2 , /x 3 
and \i over [—1, 1] x {±1} as follows. 

f j, 1 (7,i) = e, ^(-7,-1) = 1-0 

/A-7,l) = l 

The measure /z 3 is the product of uniform{±l} and the measure over [—1, 1] whose density 
function is 

fO \x\ > | 

(.7Ty/l-(8xf 8 

Finally, \i = (1 — Ai — A2)/U 1 + A2/U 2 + A3/i 3 with A2, A3 > to be chosen later. 

Now, let e G S 11 ' 1 be the vector from Lemma 5.16. The distribution T> is the pullback of 
// w.r.t. e. By considering the affine functional A e , it holds that Err 7 (D) < A3 + A2. 

Let g be the solution returned by the algorithm. With probability > 1 — exp(— 1/7), 
g = f + b, where /, b is a solution to program (11) with C = (7a (7) and with an additive 
error < ^7. As in the proof of Theorem 2.5, it holds that 



J {x:(x,e)=7| J \x:{ 



9 

'{x:(x,e)=7} J {x:(x,e)=— 7} 



Denote the last bound by e. It holds that 

En Vjl (g) = (1 - A 2 - A 3 )E M j%0(aO) + ^J(yg(x)) + A 3 E /4 %< 7 (x)) (14) 
Now, denote 5 = L x .i x e }=_ 7 } 9- ^ holds that 



E M i/(^(x)) = / 



l(g( X )) + (1 - 9) / /(-^)) 

{x:(a;,e}=7} >/ {x:(x,e)=— 7} 



> e-ll g\+(l-0).U- g) (15) 

\.</{x:(:r,e}=7} / V J {x:{x,e} = — y} 

> 9 ■ 1(5) + (1 - 9) ■ l(-5) - Le 
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Thus, 

Err Vtl (g) > (1 - A 2 - X 3 )(9 ■ 1(5) + (1 - 9) ■ l{-8)) - Le + \ 2 E^l(yg(x)) 
However, by considering the constant solution S, it follows that 

VnvM < (1 - A 2 - X 3 )(9l(5) + (1-9)- 1(-8)) + A 2 • 1(5) + A 3 ^ (1(5) + l(-5)) + ^7 

< (1 - A 2 - X 3 )(9 ■ 1(5) + (1-9)- 1(-5)) + A 2 • 1(5) + A 3 • l(-\5\) + ^ 

Thus, 

Err^) < y 2 +l ^ + Y 2 l{ ~^ )Jr ^ (16) 
- /(0)128 ^ L + 1Q - L / K3 ' 5 ■ E ■ C ■ (r K + s d ) + «*) + ^(-1*1) + ^ 



|<9+/(0)|A 2 A 3 A 2 v A 2 A 



2 



Now, relying on the assumption that 7 • log 8 (C) = o(l), it is possible to choose A 2 = 
(^7^ 4 ) = © (v/7log 4 (C)), A 3 = V7, # = e(log(C/ 7 )), and d = 6(log(C/ 7 )) such that 
the bound in Equation (13), L ffgff 3 5 + ■ E-C ■ (r K + A 2 , A 3 and ^ are all o(l). 

Since the bound in Equation (13) is o(l), it follows, as in the proof of Theorem 2.5, that 
1(3) — I (f ) an d consequently, < f < 5. From equations (14) and (15), it follows that 

£ e 1 Errp.i(g) £ £ 1 21(0) 

l(-\5\) = l(-5) < j[ ^- A3 < j[ = 0(1) 

It now follows from Equation (16) that 

E {X)y) ^ fi J(g(x)y) = Err^g) < I + o(l) 

By Markov's inequality, 

Pr (l(g(x)y) > 1(0)) < 1 (f } * 

1(0)— 1( — ) 

Thus, if (x,y) are chosen according to /ig, then w.p. > — ^ v 2 ; — o(l), l(g(x)) < 1(0) =>- 
g(x) > 0. Since the marginal distributions of n\ and fi 2 e are the same, it follows that, if (x, y) 

are chosen according to T>, then w.p. > ( ^~t?w 2 — o(l) J ■ (1 — A 2 — A3) ■ (1 — 6>) = 0(1), 



□ 



Z(0) 

^(x) < 0. Thus, Err^ -i(s) = 0(1) while" Err 7 (P) < A 2 + A 3 = O (V7Poly(log(C))). 
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5.4.3 The integrality gap — Theorem 3.3 



Our first step is a reduction to the hinge loss. Let a = <9+Z(0). Define 




< i 

— —a 

'w 



it is not hard to see that I* is a convex surrogate satisfying Vx, l*(x) < l(x) and <9 + /*(0) = 
d + l(0). Thus, if we substitute / with /*, we just decrease the integrality gap, hence can 
assume that 1 = 1*. Now, we note that if we consider program (11) with I = I* the inegrality 
gap of coincides with what we get by replacing C with \a\ • C and I* with the hinge loss. 
To see that, note that for every / G Hk, b G M., Err £>,/*(/ + b) = Err£> ; hinge(M • / + l a l • b), 
thus, minimizing Errxy* over all functions / G H k that satisfy ||/||# fe < C is equivalent to 
minimizing Err^hmge over all functions / G H k that satisfy H/H^ < |o| ■ C. Thus, it is 
enough to prove the Theorem for I = /hinge- 

Next, we show that we can assume that the embedding is symmetric (i.e., correspond to 
a symmetric kernel). As the integrality gap is at least as large as the approximation ratio, 
using Theorem 3.2 this will complete our argument. (The reduction to the hinge loss yields 
bounds with universal constants in the asymptotic terms). 

Let 7 > and let P be a distribution on S^ 1 x {±1}. It is enough to find (a possibly 
different) distribution T>\ with the same 7-margin error as T>, for which the optimum of 
program (11) (with I = /hinge) is not smaller than the optimum of the program 



Denote the optimal value of program (17) by a and assume, towards contradiction, that 
whenever Err 7 (Pi) = Err 7 (D), the optimum of program (11) is strictly less then a. 

For every A G D(d), let £> A , be the distribution of the r.v. (Ax,y) G S^ 1 x {±1}, 
where (x,y) ~ V. Since clearly Err^T^) = Err 7 (T>), there exist Ja £ H k and b& G R 
such that \\f A \\H k < C and Err^hingc^A) < a, where g A := f A + b A . Define / G H ks by 
f( x ) = Io(d) fA{Ax)dA and let b = J 0(d) b A dA and g = f + b. By Theorem 5.12, \\f\\H ks < C. 
Finally, for I = l hinge , 

Err^hingetfiO = ^ x ,y)~T>l{yg{x)) 

= ^(x,y)^vl{y^ A ^0(d)9 A {Ax)) 

< E( Xiy) ^ v E A ^ Q ( d) l(yg A (Ax)) 

= ^A~0(d)^(x,y)~vl(ygA(Ax)) 

= E A ^ (d)E( X: y)^v A l(yg A (x)) < a 

Contrary to the assumption that a is the optimum of program (17). 
5.4.4 Finite dimension - Theorems 2.6 and 3.4 

Let V C C^S"^ 1 ) be the linear space {A„ j6 o tp : v G R m , b Gl} and denote W = {A vfi o ip : 
v G W,b G M.}. We note that dim(V) < m + 1. Instead of program (4) we consider the 



min Err^hingc (/ + 6) 
s.t. /64,kM 

ll/lk.<c 



(17) 
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equivalent formulation 



min Errxy (/) 
s.t. few 



Lemma 5.19 (John's Lemma) (Matousek, 2002) Let V be an m- dimensional real vector 
space and let K be a full- dimensional ^-symmetric compact convex set. Then there exists an 
inner product on V so that K is contained in the unit ball, and contains the ball around 
of radius -j= . 

Lemma 5.20 Let I be a convex surrogate and let V C C(S' d_1 ) an m- dimensional vector 
space. There exists a continuous kernel k : S^ 1 x S^ 1 — > K with swp x&S d-i k(x,x) < 1 such 
that Hk = V as a vector space and there exists a probability measure /xjv such that 

2m 1 - 5 

V/eV, |i/|k< ]^pj Er w(/) 

Proof Let ip : S^ 1 — > V* be the evaluation operator. It maps each x G S^ 1 to the linear 
functional / G V (->• f(x). We claim that 

1. ip is continuous, 

2. aff(^(5 d - 1 ) U -^j(S d - 1 )) = V*, 

3. V = {v** o^j : v** G V**}. 

Proof of 1: We need to show that ip(x n ) — > ip(x) if x n — > x. Since V* is finite dimensional, it 
suffices to show that ip(x n )(f) — > ip(x)(f) for every / G V, which follows from the continuity 
of/- 

Proof of 2: Note that G U = aff(^(^ 1 ) U -^(S^ 1 )), so U is a linear space. Now, define 
T :[/*—)■ V via T(u*) = u* o ip. We claim that T is onto, whence dim([7) = dim(C/*) = 
dim(V) = dim(V*), so that £/ = V*. Indeed, for / G V, let G Z7* be the functional 
u){u) = u{f). Now, T{u)){x) = u*0(x)) = j>(x)(f) = f{x), thus T{u}) = f. 
Proof of 3: From U — V* it follows that [/* = K**, so that the mapping T : V** — > V is 
onto, showing that V = o ^ : v** G K**}. 

Let us apply John's Lemma to K — conv(^(S" i " 1 )U— ^(S^ 1 )). It yields an inner product 
on V* with K contained in the unit ball and containing the ball around with radius -4=. 
Let be the kernel k(x,y) = (ip(x), ip (y))- Since ip is continuous, k is continuous as well. 
By Theorem 5.1.1 and since T is onto, it follows that, as a vector space, V = Hk- Since K 
is contained in the unit ball, it follows that sup a . g5 d-i k(x, x) < 1. It remains to prove the 
existence of the measure fi^. 

Let ei,...,e m G V* be an orthonormal basis. For every i G [m], choose 
fo 1 ,^),...,^ 1 ,^) G 3 d - 1 x {±1} and A l 1 ,...,Ar +1 > such that Ef=i H = 1 and 

rk^i = J2?=i KVi^(4)- Define Vn(xI, 1) = fi N (xi,-l) = 
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Let f E V. By Theorem 5.1.1 there exists v E V* such that / = A„ ° ^ and 
\v \\v- It follows that, for a = d+l(0), 



E»W/) 



> 



m m+1 . j 

i=l i=l 

1 m r /m+1 \ / m+1 

— F 

2m ^ 

i=l 

— y 

2m ^— ' 



j=i / V i=i 

m+1 \ / m+1 
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□ 



Proof (of Theorem 2.6) Let L be the Lipschitz constant of I. Let (5 > a > such that 
/(a) > l(p). Choose < < 1 large enough so that (1 - 6)l(-/3) + 6l(/3) < 61(a). First, 
define probability measures /i 1 ,^ 2 ,/! 3 and /i over [—1, 1] x {±1} as follows. 

/ i 1 ( 7 ,l) = ^, /x 1 (- 7 ,-l) = l-0 
/i 2 (- 7 ,l) = l 

The measure /z 3 is the product of uniform{±l} and the measure over [—1,1] whose density 
function is 

\x\ > | 

i 



w(x) 



\x\ > 
\x\ < 



Let k, be the distribution and kernel from Lemma 5.20. Now, let e E S^ -1 be the vector 
from Lemma 5.16. We define the distribution T> corresponding to the measure 

/I = (1 - A 2 - A 3 - \n)hI + A 2 yUe + ^lA + ^N^N 

By considering the affine functional A ej0 , it holds that Err 7 (D) < A3 + A 2 + Xn- 
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Let g be the solution returned by the algorithm. With probability > 1 — exp(— 1/7), 
g = f + b, where /, b is a solution to program (18) with an additive error < y/j. 
Denote ||<?||.ff fc = C. By Lemma 5.20, it holds that 

2m 1 - 5 

2m 1 - 5 Err^i(g) 



< 



\d+l(0)\ Xn 
2m 1 - 5 1(0) 



\d+l(0)\ Xn 

As in the proof of Theorem 2.5, it holds that 



/ '-J, 

J \x:(x,e)=~i\ J \x 



9 

'{x:(a;,e)=7} J{x:(x,e)=— 7} 
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"■'(/ 
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+ (1-0)- 1(- [ 




\J {x:{x,e}- 


=7} / 


V </ {x:{x,e) 
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•*(<*) + (1- 


-0)-Z( 


-6) - Le 


> (1-A2 


— A3 — Aat)(0 ■ 


J(*) + 


(l-0)-Z(-5))-Le + 



Denote the last bound by e. It holds that 

Err^fo) = (1 - A 2 - A 3 - X N )E fl iJ(yg(x)) + X 2 E fl2 J(yg(x)) + X 3 E /l3 J(yg(x)) + X N E m l(yg(x)) 

(20) 

Now, denote 6 = fj x ., x e }=_ 7 } 9- ^ holds that 

E M i/(^(x)) = /" + (l - 0) f K-g(x)) 

J {x:(x,e)=-y} J {x:(x,e)=— 7} 

ff) (21) 



Thus, 



However, by considering the constant solution 6, it follows that 

Err^G?) < (1 - A 2 - A 3 - X N ){a ■ 1(6) + (1 - 0) • l(-5)) + A 2 • 1(6) + (A 3 + X N )^ (l(S) + l(-S)) + ^7 
< (1 - A 2 - A 3 - X N )(9 ■ 1(6) + (1 - 0) ■ l(-6)) + A 2 • 1(6) + (A 3 + X N ) ■ l(-\6\) + ^7 

Thus, 

En„ 3J ( 9 ) < y + HS) + X -^l(-\S\) + f (22) 

Now, relying on the assumption that 7 • log 8 (C) = o(l), it is possible to choose A 2 = 
6 (V7^ 4 ) = © (v/7log 4 (C)), A 3 = V7, # = e(log(C/ 7 )), X N = ia ndd = 0(log(C/ 7 )) 

33 



such that the bound in Equation (19), ^ffimjjp|~^ + • E • C • {r K + s d ), A2, A3, \n and 
*5±^g±v5 are all o(l). 

Since the bound in Equation (19) is o(l), it follows, as in the proof of Theorem 2.5, that 
1(8) < I (f ) and consequently, < f < 5. From equations (20) and (21), it follows that 

Le Err Pji (g) ^ 2/(0) 
Z(-|5|) = l(-S) < l-A^As-A^ < l-\ 2 -\s-\ N = Q ^ 

1 — 1 — 9 

It now follows from Equation (22) that 

^{x, y )~nAg{ x )y) = Err ^(#) < 1 (|) + 

By Markov's inequality, 

Pr (%(z)y) > Z(0)) < j 

1(0)— 1( — ) 

Thus, if are chosen according to /x^, then w.p. > — ^ v 2 7 — o(l), l(g(x)) < 1(0) =>- 

g(x) > 0. Since the marginal distributions of n\ and /i^ are the same, it follows that, if (x, y) 

(i(o)—i( -) \ 
— ^ 2 ' — o(l) I -(1 — A2 — A3 — Ajv)*(1 — 9) =0(1), 

yg(x) < 0. Thus, Err 2?)Q _ 1 (^) = 0(1) while Err 7 (P) < A2 + A3 + AN = O (^7 poly(log(C))) = 
O ( v /7poly(log(m/7))). 

□ 



Proof (of Theorem 3.4) As in the proof of Theorem 3.3, we can assume w.l.o.g. that 

^hinge- 



^ ^hingc Let k, Hn be the measure and the kernel from Lemma 5.20. Let C = 2m L5 /j. 



By (the proof of) Theorem 3.3, there exists a probability measure ft over S^ 1 x {±1} such 
that for every / 6 H k with \\f\\H k < C it holds that Err Ai j(/) = 0(1) but Err 7 (/i) = 
0(7 • poly (log (C))). Consider the distribution // = (1 — j)ft + 7/xjv- It still holds that 
Err 1 (ft) = 0(7 • poly(log(C))) = 0(7 • poly(log(m/7))). Let / be an optimal for program 
(18). We have that 1 > Err^(/) > 7 • Er W (/). By Lemma 5.20, \\f\\ Hk < C. Thus, 
Err^(/)>(l-7)Err ftI (/) = n(l). 

□ 



6 Choosing a surrogate according to the margin. 

The purpose of this section is to demonstrate the subtleties relating to the possibility of 
choosing a convex surrogate / according to the margin 7. Let k : B x B — > R be the kernel 

k ( x >y) = ; — 17 — r~ 
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and let if) : B — ^ H\ be a corresponding embedding (i.e., k(x,y) = (if)(x),if>(y))H 1 )- In 
(Shalev-Shwartz et al., 2011) it has been shown that the solution /, b to Program (2), with 
C = C(j) = poly(exp(l/7 • log(l/7))) and the embedding if), satisfies 



Consequently, every approximated solution to the Program with an additive error of at most 
7 will have a 0-1 loss bounded by Err 7 (P) + 27. 

For every 7, define a 1-Lipschitz convex surrogate by 



Claim 1 A function g : B — ^ K is a solutions to Program (5) with I — Z 7; C = 1 and the 
embedding if), if and only if Ci^f) ■ g is a solutions to Program (2) with C = Ci^y) and the 
embedding if). 

We postpone the proof to the end of the section. We note that Program (5) with I = Z 7 , 
C = 1 and the embedding if), have a complexity of 1, according to our conventions. Moreover, 
by Claim 1, the optimal solution to it has a 0-1 error of at most Err 7 (D) + 7. Thus, if A is 
an algorithm that is only obligated to return an approximated solution to Program (5) with 
I = l 7 , C = 1 and the embedding if), we cannot lower bound its approximation ratio. In 
particular, our Theorems regarding the approximation ratio are no longer true, as currently 
stated, if the algorithms are allowed to choose the surrogate according to 7. One might be 
tempted to think that by the above construction (i.e. taking if) as our embedding, choosing 
C — 1 and I = Z 7 , and approximate the program upon a sample of size poly(l/7)), we have 
actually gave 1-approximation algorithm. The crux of the matter is that algorithms that 
approximate the program according to a finite sample of size poly(l/7) are only guaranteed 
to find a solution with an additive error of poly (7). For the loss / 7 , such an additive error 
is meaningless: Since for every function /, Err^ (/) > 1 — 1/(7(7), the solution has an 
additive error of poly (7). Therefore, we cannot argue that the solution returned by the 
algorithm will have a small 0-1 error. Indeed we anticipate that the algorithm we have 
described will suffer from serious over-fitting. 

To summarize, we note that the lower bounds we have proved, relies on the fact that the 
optimal solutions of the programs we considered are very bad. For the algorithm we sketched 
above, the optimal solution is very good. However, guaranties on approximated solutions 
obtained from a polynomial sample are meaningless. We conclude that lower bounds for 
such algorithms will have to involve over-fitting arguments, which are out of the scope of the 
paper. 

Proof (of claim 1) Define 



Err hinge (/ + b) < Err 7 (£>)+ 7 . 




x < 1/C( 7 ) 
x > 1/C( 7 ) 
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Since l*(x) = (7 (7) • (/ 7 (x) — (1 — CF7)))' ^ follows that the solutions to Program (5) with 
I — I*, C — 1 and ip coincide with the solutions with I — l 7 , C — 1 and ip. Now, we note 
that, for every function / : B — > R, 

Err^ ; (/) = Err^ hinge (C( 7 )-/) 

Thus, 6 minimizes Err^/* (A^j, o ^) under the restriction that ||to|| < 1 if and only if 
(7(7) • w, C(j) ■ b minimizes Err^hmgelA^f, o ip) under the restriction that ||w|| < C(j). 

□ 
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